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Abstract 

Background: The miRNAs, a class of short approximately 22-rnucleotide rnon-codirng RNAs, often act 
post-transcriptiornally to inhibit mRNA expression. In effect, they control gene expression by targeting mRNA. They 
also help in carrying out normal functioning of a cell as they play an important role in various cellular processes. 
However, dysregulation of miRNAs is found to be a major cause of a disease. It has been demonstrated that miRNA 
expression is altered in many human cancers, suggesting that they may play an important role as disease biomarkers. 
Multiple reports have also noted the utility of miRNAs for the diagnosis of cancer . Among the large number of 
miRNAs present in a microarray data, a modest number might be sufficient to classify human cancers. Hence, the 
identification of differentially expressed miRNAs is an important problem particularly for the data sets with large 
number of miRNAs and small number of samples. 

Results: In this regard, a new miRNA selection algorithm, called /xHEM, is presented based on rough hypercuboid 
approach. It selects a set of miRNAs from a microarray data by maximizing both relevance and significance of the 
selected miRNAs. The degree of dependency of sample categories on miRNAs is defined, based on the concept of 
hypercuboid equivalence partition matrix, to measure both relevance and significance of miRNAs. The effectiveness of 
the new approach is demonstrated on six publicly available miRNA expression data sets using support vector 
machine. The .632+ bootstrap error estimate is used to minimize the variability and biasedness of the derived results. 

Conclusions: An important finding is that the /xHEM algorithm achieves lowest R632+ error rate of support vector 
machine with a reduced set of differentially expressed miRNAs on four expression data sets compare to some existing 
machine learning and statistical methods, while for other two data sets, the error rate of the /xHEM algorithm is 
comparable with the existing techniques. The results on several microarray data sets demonstrate that the proposed 
method can bring a remarkable improvement on miRNA selection problem. The method is a potentially useful tool for 
exploration of miRNA expression data and identification of differentially expressed miRNAs worth further investigation. 

Keywords: MicroRNA, Feature selection, Rough hypercuboid, Bootstrap error. Support vector machine 



Background 

The microRNAs or miRNAs are small non-coding RNAs 
of length around 22 nucleotides, present in many plants 
and animals. They repress the expression of a gene post- 
transcriptionally. In effect, they regulate expression of 
a gene or protein. The miRNAs are related to diverse 
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cellular processes and regarded as important components 
of gene regulatory network. Studies into miRNA function 
have mainly focused on a variety of human diseases, par- 
ticularly cancer, and mainly related to the use of miRNAs 
as disease biomarkers and for monitoring drug efficacy. 
Multiple reports have noted the utility of miRNAs for the 
diagnosis of cancer and other diseases [1]. 

Unlike with mRNAs, a modest number of miRNAs 
might be sufficient to classify human cancers [1]. More- 
over, the bead-based miRNA detection method has the 
attractive property of being not only accurate and specific. 
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but also easy to implement in a routine clinical setting. 
In addition, unlike mRNAs, miRNAs remain largely intact 
in routinely collected, formalin-fixed, paraffin-embedded 
clinical tissues [2]. Recent studies have also shown that 
miRNAs can be detected in serum. These studies offer 
the promise of utilizing miRNA screening via less invasive 
blood-based mechanisms. In addition, mature miRNAs 
are relatively stable. These phenomena make miRNAs 
superior molecular markers and targets for interrogation 
and as such, miRNA expression profiling can be utilized 
as a tool for cancer diagnosis and other diseases. 

The functions of miRNAs appear to be different in 
various cellular functions. Just as miRNA is involved in 
the normal functioning of eukaryotic cells, so has dys- 
regulation of miRNA been associated with disease [3]. It 
indicates that these miRNAs can prove to be potential 
biomarkers for developing a diagnostic tool. Hence, insil- 
ico identification of differentially expressed miRNAs that 
target genes involved in diseases is necessary. These differ- 
entially expressed miRNAs can be further used in devel- 
oping effective diagnostic tools. Recently, few studies are 
carried out to identify differentially expressed miRNAs 
[4-9]. However, absence of robust method makes it an 
open problem. 

A miRNA expression data set can be represented by an 
expression table or matrix, where each row corresponds 
to one particular miRNA, each column to a sample, and 
each entry of the matrix is the measured expression level 
of a particular miRNA in a sample, respectively. However, 
for microarray data, the number of training samples is typ- 
ically very small, while the number of miRNAs is in the 
thousands. Hence, the prediction rule formed by any clas- 
sifier may not be able to be formed by using all available 
miRNAs. Even if all the miRNAs can be used, the use of 
all the miRNAs allows the noise associated with miRNAs 
of little or no discriminatory power, which inhibits and 
degrades the performance of the prediction rule in its 
application to unclassified or test samples. In other words, 
although the apparent error rate, which is the propor- 
tion of the training samples misclassified by the prediction 
rule, will decrease as it is formed from more and more 
miRNAs, its error rate in classifying samples outside of 
the training set eventually will increase. That is, the gen- 
eralization error of the prediction rule will be increased if 
it is formed from a sufficiently large number of miRNAs. 
Hence, in practice, consideration has to be given to imple- 
ment some procedure of feature selection for reducing 
the number of miRNAs to be used in constructing the 
prediction rule [10]. 

The method called significance analysis of microar- 
rays is used in several works [11-16] to identify dif- 
ferentially expressed miRNAs. Different statistical tests 
are also employed to identify differentially expressed 
miRNAs [1,4-8,17-20]. Xu et al. [21] used particle swarm 



optimization technique for selecting important miRNAs 
that contribute to the discrimination of different cancer 
types. However, one of the main problems in miRNA 
expression data analysis is uncertainty. Some of the 
sources of this uncertainty include imprecision in com- 
putations and vagueness in class definition. In this back- 
ground, the rough set theory has gained popularity 
in modeling and propagating uncertainty. It deals with 
vagueness and incompleteness and is proposed for indis- 
cernibility in classification according to some similarity 
[22] . It has been applied successfully to feature selection 
of discrete valued data [23]. Given a data set with dis- 
cretized attribute values, it is possible to find a subset 
of the original attributes using rough set theory that are 
the most informative; all other attributes can be removed 
from the data set with minimal information loss. The 
theory of rough sets has also been successfully applied to 
microarray data analysis in [9,24-35] . 

However, the real life high dimensional microarray data 
set may contain a number of irrelevant and insignificant 
miRNAs [9]. The presence of such miRNAs may lead to 
a reduction in useful information and degrade the predic- 
tion capability. The selected miRNA subset should contain 
the miRNAs those have high relevance with the classes 
and high significance in the miRNA set. Such miRNAs 
are expected to be able to predict the classes of the sam- 
ples. Accordingly, a measure is required that can assess the 
effectiveness of a miRNA set [9] . 

In microarray data, the class labels of samples are rep- 
resented by discrete symbols, while the expression values 
of miRNAs are continuous. Hence, to measure both rele- 
vance and significance of miRNAs using rough set theory, 
the continuous expression values of a miRNA have to be 
divided into several discrete partitions to generate equiv- 
alence classes [9] . However, the inherent error that exists 
in discretization process is of major concern in the com- 
putation of the dependency of real valued features. The 
rough hypercuboid approach of Wei et al. [36] is found to 
be suitable for numerical data sets. 

In this regard, this paper presents a new miRNA selec- 
tion method, termed as /xHEM. It employs rough hyper- 
cuboid approach to provide a means by which real valued 
noisy data can be effectively reduced without the need for 
user-specified information. The proposed method selects 
a subset of miRNAs from whole miRNA set by maxi- 
mizing both relevance and significance of the selected 
miRNAs. Using the concept of hypercuboid equivalence 
partition matrix, the degree of dependency is calculated 
for miRNAs, which is used to compute both relevance 
and significance of the miRNAs. Hence, the only infor- 
mation required in the proposed method is in the form 
of equivalence classes for each miRNA, which can be 
automatically derived from the data set. The concept of 
so-called 5.632+ error rate [37] is used to minimize the 
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variability and biasedness of the derived results. The sup- 
port vector machine is used to compute the 5.632+ error 
rate as well as several other types of error rates as it 
maximizes the margin between data samples in different 
classes. The effectiveness of the proposed approach, 
along with a comparison with other related approaches, 
is demonstrated on several miRNA expression data 
sets. 

Methods 

Data sets used 

In the current research work, publicly available six 
miRNA expression data sets with accession number 
GSE17681, GSE17846, GSE21036, GSE24709, GSE28700, 
and GSE31408 are used, which are downloaded from 
Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/). 

GSE17681 

This data set has been generated to detect specific pat- 
terns of miRNAs in peripheral blood samples of lung can- 
cer patients. As controls, blood of donors without known 
affection have been tested. The number of miRNAs, sam- 
ples, and classes in this data sets are 866, 36, and 2, 
respectively [38]. 

GSE17846 

This data set represents the analysis of miRNA profiling 
in peripheral blood samples of multiple sclerosis and in 
the blood of normal donors. It contains 864 miRNAs, 41 
samples, and 2 classes [39]. 

GSE21036 

This data set contains miRNA expression profiles of 218 
prostate tumors with primary or metastatic prostate can- 
cer with a median of 5 years clinical follow-up. The num- 
ber of miRNAs and samples are 373 and 141, respectively 
[40]. 

GSE24709 

It analyzes peripheral miRNA blood profiles of patients 
with lung diseases. The miRNA expression profiling has 
been done for patients with lung cancer, chronic obstruc- 
tive pulmonary disease, and normal controls. It contains 
total 863 miRNAs, 71 samples, and 3 classes. 

GSE28700 

This data set contains expression profiles of miRNAs 
from 22 paired gastric cancer and normal tissues. It con- 
tains total 44 samples and 470 miRNAs. The samples are 
grouped into 2 classes [41]. 

GSE31408 

It analyzes miRNA expression profiles of cutaneous T-cell 
lymphomas and benign inflammation of skin. It consists 
of total 705 miRNAs, 148 samples, and 2 classes [42]. 



Method 

Hypercuboid equivalence partition matrix 

Let U = {^1, • • • fXif--- fXn}he the set of n objects or sam- 
ples and C = {Ai, • • • , Au • • • > Aj, • • • , Am) denotes the 
set of m attributes or miRNAs of a given microarray data 
set T = — 1, • • • , yriyj — 1, • • • , «}, where Wij e 9^1 

is the measured expression value of the miRNA At in the 
sample xj. Let D be the set of class labels or sample cate- 
gories of n samples. In rough set theory, the attribute sets 
C and B are termed as the condition and decision attribute 
sets in U, respectively. 

If U/B = {Ply ■ ■ ■ y ■ ■ • , Pc} denotes c equivalence 
classes or information granules of U generated by the 
equivalence relation induced from the decision attribute 
set B, then c equivalence classes of U can also be generated 
by the equivalence relation induced from each condition 
attribute Ak e C UV/Ak = {h> " - > <5/, • • • , 5^} denotes c 
equivalence classes or information granules of U induced 
by the condition attribute Ak and n is the number of 
objects in U, then c-partitions of U are the sets of (en) 
values {hij(Ak)} that can be conveniently arrayed as a 
(c X n) matrix IHI(^^) =[hij(Ak)l The matrix M(Ak) is 
denoted by 

/hii(A) huUk) ••• ^ln(Ak)\ 

^2l(Ak) h22(Ak) ••• h2n(Ak) 



\hci{Ak) hc2(Ak) ••• hcn(Ak) J 



, , , , f 1 if L/ < Xj(Ak) < Ui 
where h,(A) = , ^^^^ ' 



(1) 



(2) 



The tuple [L/,U/] represents the interval of ith class 
Pi according to the decision attribute set B. The interval 
[ Li, Ui] is the value range of condition attribute Ak with 
respect to class It is spanned by the objects with same 
class label Pi, That is, the value of each object xj with class 
label Pi falls within interval [L/, U/]. This can be viewed 
as a supervised granulation process, which utilizes class 
information. 

Generally, an m-dimensional hypercuboid or hyperrect- 
angle is defined in the m-dimensional Euclidean space, 
where the space is defined by the m variables measured 
for each sample or object. In geometry, a hypercuboid 
or hyperrectangle is the generalization of a rectangle 
for higher dimensions, formally defined as the Cartesian 
product of orthogonal intervals. A (i-dimensional hyper- 
cuboid with d attributes as its dimensions is defined as the 
Cartesian product of d orthogonal intervals. It encloses 
a region in the /i-dimensional space, where each dimen- 
sion corresponds to a certain attribute. The value domain 
of each dimension is the value range or interval that 
corresponds to a particular class. 

The cxn matrix IHI(.4.)^) is termed as hypercuboid equiv- 
alence partition matrix of the condition attribute A/c- It 
represents the c-hypercuboid equivalence partitions of the 
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universe generated by an equivalence relation. Each row of 
the matrix IHI(^^) is a hypercuboid equivalence partition 
or class. Here h.ij(Ak) e {0, 1} represents the member- 
ship of object Xj in the ith. equivalence partition or class Pi 
satisfying following two conditions: 

n 

l<^hy(A)<«.Vj; (3) 

c 

l<^h^U^)<c,V;. (4) 

i=l 

The above axioms should hold for every equivalence 
partition, which correspond to the requirement that 
an equivalence class is non-empty. However, in real 
data analysis, uncertainty arises due to overlapping class 
boundaries. Hence, such a granulation process does not 
necessarily result in a compatible granulation in the sense 
that every two class hypercuboids or intervals may inter- 
sect with each other. The intersection of two hypercuboids 
also forms a hypercuboid, which is referred to as implicit 
hypercuboid. The implicit hypercuboids encompass the 
misclassified samples or objects those belong to more 
than one classes. The degree of dependency of the deci- 
sion attribute set or class label on the condition attribute 
set depends on the cardinality of the implicit hyper- 
cuboids. The degree of dependency increases with the 
decrease in cardinality. Hence, the degree of dependency 
of decision attribute on a condition attribute set is evalu- 
ated by finding the implicit hypercuboids that encompass 
misclassified objects. Using the concept of hypercuboid 
equivalence partition matrix, the misclassified objects of 
implicit hypercuboids can be identified based on the con- 
fusion vector defined next 

V(A) = [vi(A), • • • >yj(Ak). ' ' ' .VniAk)] (5) 

c 

where yj(Ak) = min{l, ^ hjiAk) - 1}. (6) 

i=l 

According to the rough set theory, if an object xj belongs 
to the lower approximation of any class fii, then it does 
not belong to the lower or upper approximations of any 
other classes and yj(Ak) = 0. On the other hand, if the 
object Xj belongs to the boundary region of more than 
one classes, then it should be encompassed by the implicit 
hypercuboid and vj(Ak) = 1. Hence, the hypercuboid 
equivalence partition matrix and corresponding confu- 
sion vector of the condition attribute A^ can be used to 
define the lower and upper approximations of the ith. class 
Pi of the decision attribute set D. 

Let Pi c U. Pi can be approximated using only the infor- 
mation contained within Ak by constructing the A-lower 
and A-upper approximations of Pi: 

A(Pi) = {xj\ hijiAk) = 1 and yj(Ak) = 0}; (7) 



A(Pi) = {xj\ hij(Ak) = 1}; (8) 

where equivalence relation A is induced from attribute Ak- 
The boundary region of Pi is then defined as 

BNA(Pi) = {Xj\ hijUk) = 1 and vjUk) = 1}. (9) 
Dependency 

Combining (1), (5), and (7), the dependency between con- 
dition attribute Ak and decision attribute D can be defined 
as follows: 

^ c n 
i=l j=l 

1 

that is, YA^m = 1 - - I] V;(A)^ (11) 

where 0 < K^^(B) < 1. If y^^(B) = 1, B depends totally 
on Ak> if 0 < YA/, (B) < 1, B depends partially on Ak> and 
if YAk (^) = 0, then B does not depend on Ak^ The YAk (^) 
is also termed as the relevance of attribute Ak with respect 
to class B. 

Significance 

Given two condition attributes Ak and Ai, the cxn hyper- 
cuboid equivalence partition matrix corresponding to the 
set A = {Aky Ai) can be calculated from two cxn hyper- 
cuboid equivalence partition matrices IH[(^^) and IH[(^/) 
as follows: 

mAk. Ai}) = M(Ak) n UiAi); (12) 

where h,y({ A, Ai}) = h,y(A) H hij(Ai), (13) 

The change in dependency when an attribute is removed 
from the set of condition attributes, is a measure of the 
significance of the attribute. To what extent an attribute 
is contributing to calculate the dependency on decision 
attribute can be calculated by the significance of that 
attribute. The significance of the attribute Ak with respect 
to the condition attribute set {Ak, Ai) is given by 

1 

otaCO, Ak) = -Y, h(A - {Ak}) - v/(A)] ; (14) 

;=i 

where 0 < (y{Ak,Ai](P^Ak) < 1. Hence, the higher the 
change in dependency, the more significant the attribute 
Ak is. If significance is 0, then the attribute is dispensable. 

/jilHEM: proposed miRNA selection method 

Let YAi (B) be the relevance of the miRNA Ai with respect 
to the class labels B and cr{^.,^^}(B, Ai) is the significance 
of the miRNA Ai with respect to another miRNA Aj G 
S, where S is the set of selected miRNAs. The average 
relevance of all selected miRNAs is, therefore, given by 
1 ^ 

Jre\ey= 7^7 2^ YAiW, (15) 
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while the average significance among the selected 
miRNAs is as follows 

1 

eTsignf ■ 



|S|(|S|-1) 

(16) 

Therefore, the problem of selecting a set S of relevant 
and significant miRNAs from the whole miRNA set C 
is equivalent to maximize Jrelev and Jsignf> that is, to 
maximize the objective function J, where 



J = (Ojrelev + (1 — <w)c7signf 



(17) 



where a; is a weight parameter. To solve the above prob- 
lem, the following greedy algorithm is used. 

1. Initialize C <- {Ai, • • • , A', • • • . Am}y S ^ 0. 

2. Generate hypercuboid equivalence partition matrix 
EI(A) and corresponding confusion vector Y(Ai) for 
each miRNA Ai e C using (1) and (5), respectively. 

3. Calculate the relevance y^. (D) of each miRNA 
Ai e C using (11). 

4. Select the miRNA Ai as the most relevant miRNA 
that has highest relevance value Ka^ (B). In effect, 
A- G § and C = C \ Ai. 

5. Repeat the following two steps until C = 0 or the 
desired number of miRNAs is selected. 

6. Repeat the following four steps for each of the 
remaining miRNAs of C. 

(a) Generate hypercuboid equivalence partition 
matrix IHI({A-, Aj}) using (12) between each 
selected miRNA Ai e S and each miRNA 
Aj e C. 

(b) Generate corresponding confusion vector 
V({ A', Aj}) for two miRNAs Ai and Aj using 
(5). 

(c) Calculate the significance of each miRNA 
Aj G C with respect to each of the already 
selected miRNAs of S using (14). 

(d) Remove Aj from C if it has zero significance 
value with respect to any one of the selected 
miRNAs. In effect, C = C \ Aj. 

7. From the remaining miRNAs of C, select miRNA Aj 
that maximizes the following condition: 

coYAj m + ^{A.-4,} (B, Aj). (18) 

As a result of that, G S and C = C \ Aj. 

8. Stop. 



Computational complexity 

The proposed /xHEM method has low computational 
complexity with respect to the number of miRNAs, sam- 
ples, and classes. Prior to computing the relevance or 
significance of a miRNA, the hypercuboid equivalence 
partition matrix and confusion vector for each miRNA 
are to be generated first, which are carried out in Step 2 
of the proposed algorithm. The computational complex- 
ity to generate a (c x «) hypercuboid equivalence partition 
matrix is 0(cn), where c and n represent the number of 
classes and objects in the data set, respectively, while the 
generation of confusion vector has also 0(cn) time com- 
plexity. In effect, the computation of the relevance of a 
miRNA has 0(cn) time complexity. Hence, the total com- 
plexity to compute the relevance of m miRNAs, which 
is carried out in Step 3 of the proposed algorithm, is 
0{mcn). The selection of most relevant miRNA from the 
set of m miRNAs, which is carried out in Step 4, has a 
complexity 0{m). 

There is only one loop in Step 5 of the proposed miRNA 
selection method, which is executed {d — 1) times, where 
d represents the number of selected miRNAs. The com- 
plexity to compute the significance of a candidate miRNA 
with respect to another miRNA has also the complex- 
ity 0{cn). If fk represents the cardinality of the already 
selected miRNA set, the total complexity to compute the 
significance of (m — m) candidate miRNAs, which is car- 
ried out in Step 6, is 0{{m — m)cn). The selection of a 
miRNA from (m — m) candidate miRNAs by maximizing 
relevance and significance, which is carried out in Step 7, 
has a complexity 0(m — m). Hence, the total complexity 
to execute the loop (d — 1) times is (0((d — l)((m — m) + 
(m — m)cn)) =)0{dcn{m — m)). 

In effect, the selection of a set of d relevant and sig- 
nificant miRNAs from the whole set of m miRNAs using 
the proposed hypercuboid equivalence partition matrix 
based first order incremental search method has an over- 
all computational complexity of (0(mcn) -\-0(m) -\-0(dcn 
(m — m)) =)0(dnm) as c, m << m. 

8,632+ error rate 

In order to minimize the variability and biasedness of 
derived result, the so-called 5.632+ bootstrap approach 
[37] is used, which is defined as follows: 



5.632+ = (1 - o))AE + cbBl 



(19) 



where AE denotes the proportion of the original training 
samples misclassified, termed as apparent error rate, and 
Bl is the bootstrap error, defined as follows: 



1 



/=1 



/ M \ 

M 



(20) 
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where n is the number of original samples and M is the 
number of bootstrap samples. If the sample xj is not con- 
tained in the ^th bootstrap sample, then /y^ = 1, otherwise 
0. Similarly, if xj is misclassified, Qjj^ = 1, otherwise 0. The 
weight parameter o) is given by 



0) : 



0.632 
1 - 0.368r' 



where r = 



Bl-AE 
y -AE ' 



c 



and y = 2_^Piil - qi); 



(21) 



(22) 



(23) 



i=l 



where c is the number of classes, pi is the proportion of 
the samples from the ith. class, and qi is the proportion 
of them assigned to the ith. class. Also, y is termed as the 
no-information error rate that would apply if the distribu- 
tion of the class-membership label of the sample xj did not 
depend on its feature vector. 

Support vector machine 

In the current study, the support vector machine (SVM) 
[43] is used to evaluate the performance of the proposed 
/xHEM algorithm as well as several other feature selection 
algorithms. The SVM is a margin classifier that draws 
an optimal hyperplane in the feature vector space; this 
defines a boundary that maximizes the margin between 
data samples in different classes, therefore leading to 
good generalization properties. A key factor in the SVM is 
to use kernels to construct nonlinear decision boundary. 
In the present work, linear kernels are used. The source 
code of the SVM has been downloaded from Library for 
Support Vector Machines (www.csie.ntu.edu.tw/~cjlin/ 
libsvm/). 

To compute different types of error rates obtained 
using the SVM, bootstrap approach is performed on each 
miRNA expression data set. For each training set, a set of 
differential miRNAs is first generated, and then the SVM 
is trained with the selected miRNAs. After the training, 
the information of miRNAs those were selected for the 
training set is used to generate test set and then the class 
label of the test sample is predicted using the SVM. For 
each data set, fifty top-ranked miRNAs are selected for the 
analysis. 

In order to calculate the ^.632+ error rate, apparent 
error (AE) is first calculated. This error is obtained when 
the same original data set is used to train and test a classi- 
fier. After that, the Bl error is computed from M bootstrap 
samples. Finally, the no-information error (y) is calculated 
by randomly perturbing the class label of a given data set. 
The mutated data set is used for miRNA selection and 



the selected miRNA set is used to build the SVM. Then, 
the trained SVM is used to classify the original data set. 
The error generated by this procedure is known as y rate. 
Finally, the 5.632+ error rate is computed based on the 
AE, Bl error, and y error using (19). 

Results and discussions 

The performance of the proposed hypercuboid equiva- 
lence partition matrix based miRNA selection (/xHEM) 
method is extensively studied and compared with that 
of some existing feature selection algorithms. The algo- 
rithms compared are mutual information based Info- 
Gain [44] and minimum redundancy-maximum relevance 
(mRMR) algorithm [45], method proposed by Golub et al. 
[46], rough set based maximum relevance-maximum sig- 
nificance (RSMRMS) algorithm [9,28], boosting [47] and 
lasso [48]. The source code of the proposed /xHEM algo- 
rithm, written in C language, is available at www.isical.ac. 
in/~bibl/results/mihem/mihem.html. All the algorithms 
are run in Ubuntu 12.04 LTS having machine configura- 
tion Intel Core i7-2600 CPU @ 3.40GHz x 8, and 16 GB 
RAM. 

Performance analysis of jjcHEM algorithm 

This section presents the performance of the proposed 
/xHEM algorithm on six miRNA data sets with respect to 
the 5.632+ error rate of the SVM. 

Optimum value of weight parameter oo 

The weight parameter co in (18) regulates the rela- 
tive importance of the significance of the candidate 
miRNA with respect to the already selected miRNAs 
and the relevance with the output class. If co is one, 
only the relevance with the output class is consid- 
ered for each miRNA selection. The presence of ^ co 
value lower than one is crucial in order to obtain 
good results. If the significance between miRNAs is 
not taken into account, selecting the miRNAs with 
the highest relevance with respect to the output class 
may tend to produce a set of redundant and insignif- 
icant miRNAs that may leave out useful complemen- 
tary information. On the other hand, if co is zero, the 
miRNAs are selected based on their significance val- 
ues only without considering the relevance of each 
miRNA. In effect, the selected miRNA set may con- 
tain a number of irrelevant miRNAs. Hence, the value 
of weight parameter co should be in between zero and 
one in order to obtain good results, that is, 0 < 
CO < 1. 

To find out the optimum value of co for each miRNA 
data set, the coefficient of variation (Cv) of average signif- 
icance value is used. It is a measure of relative dispersion 
and defined as a quotient between standard deviation and 
mean value. Let the average significance value of the /th 
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selected miRNA Aj with respect to the already selected 
mlRNA set Sy-i, for a given co value, be 

^7 (^) = J2 A'A-} A-) (24) 

)>i 

where B represents the set of class labels of the samples 
and Sy = U {Aj}, If and s{o)) represent the mean 
and standard deviation of the average significance values 
of d selected miRNAs for a particular value of co, then the 
Cy index is defined as follows: 



Cy{CD) 



(25) 



where mean and standard deviation for d selected 
miRNAs are computed as follows: 

1 ^ 

Kco) = (26) 



i=\ 



^ d 



(27) 



i=l 



The lower value of the Cy index, that is, the higher value 
of mean fi and lower value of standard deviation 5, ensures 
that the average significance of the set of selected miRNAs 
is higher. A good miRNA selection method should make 
the value of Cy index as low as possible. 

To find out the optimum value of co, extensive experi- 
mentation is carried out on six miRNA expression data 
sets. The value of co is varied from 0.0 to 1.0. In the cur- 
rent study, d = 30 and d = 50 top-ranked miRNAs are 
selected for analysis. Figure 1 presents the variation of the 
Cy index obtained using the proposed /xHEM algorithm 
for different values of co on six miRNA data sets. From the 
results reported in Figure 1, it is seen that as the value of 
weight parameter co increases, the Cy index decreases and 



attains its minimum value at a particular value of a; = co*. 
After that the Cy index value increases with the increase 
in the value of co. Hence, the optimum value of co for each 
data set is obtained using the following relation: 



CO 



arg min{Cv(a;)}. 



(28) 



The optimum values of co obtained using (28) are 
0.1 for GSE17681, GSE17846, GSE21036, GSE24709, and 
GSE28700, and 0.4 for GSE31408, irrespective of the num- 
ber of selected miRNAs. 

Figures 2 and 3 present the variation of the ^.632+ error 
rate obtained using the proposed /xHEM algorithm for 
different values ofco on GSE17681, GSE17846, GSE21036, 
and GSE24709 data sets as examples considering d = 50. 
From the results reported in Figures 2 and 3, it is seen 
that the ^.632+ error rate of the SVM decreases with 
the increase in the number of selected miRNAs, irrespec- 
tive of the value of co. Also, the error rate is lower for 
0.0 < a; < 0.5 than both co = 0.0 and 1.0. Similar results 
can also be seen for both GSE28700 and GSE31408 data 
sets. 

Finally, Table 1 presents the minimal ^.632+ error rate 
of the SVM for different values of weight parameter co, 
along with the value of Cy index. For each miRNA data 
set, the minimum ^.632+ error rate is written in italic, 
while the best Cy index is marked in bold. From the results 
reported in Table 1, it is seen that the proposed /xHEM 
algorithm achieves its best performance 2itco = co* in five 
cases out of total six miRNA data sets. Only for GSE28700 
data set, the ^.632+ error rate 2it co = co* is higher than 
that of both co = 0.0 and 1.0. The lowest ^.632+ error 
rate is achieved at a; = 1.0 for this data set. All the results 
reported in Figures 1, 2, and 3, and Table 1 establish the 
importance of both relevance and significance criteria in 
the proposed /xHEM method for selecting differentially 
expressed miRNAs from a microarray data. 
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Figure 1 Variation of the Cy index for different values of weight parameter co. 
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Figure 2 Variation of 5.632+ error rate on GSE1 7681 and GSE1 7846 data sets for different values of weight parameter co e [0.0, 
averaged over 50 random splits. 



1.0] 



Optimum number of selected miRNAs 

According to Lu et al. [1], unlike with mRNAs, a modest 
number of miRNAs might be sufficient to classify human 
cancers. Also, the number of training samples is typically 
very small compare to the number of miRNAs. Hence, 
the use of large number of miRNAs in constructing classi- 
fier may degrade the prediction capability on test samples 
[10]. 

In order to find out the optimum number of selected 
miRNAs, extensive experimentation is carried out on six 
microarray data sets. Figure 4 depicts the relevance and 
average significance values of each of the selected miRNAs 
for six expression data sets. The results are presented for 
optimum values of co considering 100 selected miRNAs. 
From the results reported in Figure 4, it can be seen that as 
the number of selected miRNAs increases, both relevance 
and significance values decrease. Also, the significance 



value remains constant after selecting forty to forty-five 
miRNAs, irrespective of the data sets used. Hence, in the 
current study, the selected number of miRNAs is set to 
d = 50. 

Error rate and execution time 

Figure 5 presents the variation of several error rates 
obtained using the proposed /xHEM algorithm for differ- 
ent number of samples. The data sets in ;v-axis of Figure 5 
are arranged in ascending order of the number of samples 
present in each data set, that is, the number of samples in 
GSE17681, GSE17846, GSE28700, GSE24709, GSE21036, 
and GSE31408 data are 36, 41, 44, 71, 141, and 148, 
respectively. 

From all the results reported in Figure 5, it is seen that 
different error rates such as AE, Bl, and ^.632+ do not 
depend on the number of samples present in the data set, 



GSE21036 GSE24709 




0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 

Number of Selected miRNAs Number of Selected miRNAs 

Figure 3 Variation of ^.632+ error rate on GSE21 036 and GSE24709 data sets for different values of weight parameter co e [0.0, 1.0] 
averaged over 50 random splits. 
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Table 1 Performance of /^HEM algorithm on six miRNA data sets for different values of co 



Value 


GSE17681 


GSE17846 


GSE21036 


GSE24709 


GSE28700 


GSE31408 


of 0) 


5.632+ 


Cv 


5.632+ 


Cv 


5.632+ 


Cv 


5.632+ 


Cv 


5.632+ 


Cv 


5.632+ 


Cv 


0.0 


0.0854 


0.4951 


0.0605 


0.4275 


0.0403 


0.6528 


0.1863 


0.2312 


0.2498 


0.3388 


0.0757 


0.4688 


0.1 


0.0842 


0.4421 


0.0590 


0.4042 


0.0388 


0.5956 


0.1803 


0.2171 


0.2566 


0.2693 


0.0753 


0.4275 


0.2 


0.0870 


0.4502 


0.0623 


0.4094 


0.0396 


0.6124 


0.1898 


0.2213 


0.2660 


0.2752 


0.0742 


0.4368 


0.3 


0.0851 


0.4542 


0.0644 


0.4148 


0.0410 


0.6246 


0.1878 


0.2256 


0.2572 


0.2818 


0.0732 


0.4543 


0.4 


0.0894 


0.461 1 


0.0627 


0.4206 


0.0420 


0.6319 


0.1881 


0.2312 


0.2583 


0.2889 


0.0672 


0.4190 


0.5 


0.0882 


0.4680 


0.0640 


0.4275 


0.0394 


0.6384 


0.1970 


0.2399 


0.2587 


0.2980 


0.0690 


0.5097 


0.6 


0.0882 


0.4951 


0.0651 


0.4319 


0.0392 


0.6447 


0.1940 


0.2429 


0.2571 


0.3079 


0.0693 


0.5508 


0.7 


0.0893 


0.5105 


0.0637 


0.4337 


0.0402 


0.6493 


0.1951 


0.2536 


0.2632 


0.3241 


0.0683 


0.5826 


0.8 


0.0893 


0.5202 


0.0636 


0.4366 


0.0405 


0.6528 


0.1992 


0.2564 


0.2649 


0.3388 


0.0690 


0.6088 


0.9 


0.0893 


0.5202 


0.0636 


0.4380 


0.0398 


0.6664 


0.2002 


0.2564 


0.2650 


0.3388 


0.0697 


0.6414 


1.0 


0.0860 


0.5958 


0.0724 


0.4575 


0.0410 


0.6801 


0.2095 


0.2950 


0.2475 


0.4191 


0.0693 


0.6771 



rather, they depend on the distribution of the samples in 
different classes or categories. For example, although the 
number of samples in GSE17846 and GSE28700 data sets 
is almost equal, that is, 41 and 44, respectively, there is 
a significant difference in errors for these two data sets. 
The 5.632+ errors for GSE17846 and GSE28700 data sets 
are 0.059 and 0.257, respectively. On the other hand, the 
5.632+ errors for GSE17846 data set with 41 samples and 
GSE31408 data set with 148 samples are 0.059 and 0.067, 
respectively. 

Figure 6 reports the execution time of the pro- 
posed /xHEM algorithm for different number of selected 
miRNAs. Results are presented for all six miRNA data sets 



by varying the number of selected miRNAs from 10 to 
100. From all the results reported in Figure 6, it can be 
seen that the execution time of the proposed algorithm is 
directly proportional to the number of selected miRNAs, 
total number of miRNAs and samples. 

Importance of B.632+ error rate 

This section establishes the importance of using 5.632+ 
error rate over other types of errors such as apparent 
error (AE), no-information error rate (y), and bootstrap 
error (51). Different types of errors on each miRNA 
expression data set are calculated using the SVM for the 
proposed method. All the results are presented for the 
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Figure 4 Relevance and significance values of each of the selected miRNAs for different miRNA data sets. 
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GSE17681 GSE17846 GSE28700 GSE24709 GSE21036 GSE31 



Data Sets 

Figure 5 Variation of several error rates obtained using fiHEM 
algorithm for different number of samples. 



the best AE for each miRNA data set is same for most 
of the algorithms. Both proposed /xHEM algorithm and 
mRMR method attain the best AE value for all data sets, 
while the method proposed by Golub et al. and InfoGain 
achieve it for five data sets and boosting and RSMRMS 
method attain this value on two data sets. However, the 
/xHEM achieves the best AE value with lower number of 
selected miRNAs than that obtained by other methods 
on GSE17681, GSE17846, and GSE24709 data sets, while 
mRMR method attains it for GSE21036 and GSE28700 
data sets and the method proposed by Golub et al. on 
GSE31408 data set. On the other hand, the boosting 
method attains lowest Bl error rate in four cases out of 
total six data sets, while the /xHEM method and lasso 
achieve it only for GSE21036 and GSE31408 data sets, 
respectively. 



optimum values of co considering d = 50. Figures 7 
and 8 represent various types of errors obtained by the 
proposed algorithm on GSE17681, GSE17846, GSE21036, 
and GSE24709 data sets as examples. From Figures 7 
and 8, it is seen that different types of errors decrease 
as the number of selected miRNAs increases. Similar 
results are also found for both GSE28700 and GSE31408 
data sets. For all six data sets, the AE attains consis- 
tently lowest value, while y has highest value. On the 
other hand, the Bl has smaller error rate than y but 
it is higher than the AE, Moreover, the ^.632+ esti- 
mate has smaller error rate than the Bl but higher than 
the AE, 

Table 2 reports the minimum values of different errors, 
along with the number of miRNAs required to attain these 
values. From all the results reported in this table, it can be 
seen that the ^.632+ estimator corrects the upward bias 
of ^1 and downward bias ofAE, Also, it puts more weight 
on ^1 in situation where the amount of overfitting as mea- 
sured by (51 — AE) is relatively large. It thus is applicable 
in the present context where the prediction rule generated 
by the SVM may be overfitted. 

Comparative performance analysis 

This section compares the performance of the proposed 
/xHEM algorithm with that of InfoGain [44], mRMR 
algorithm [45], method proposed by Golub et al. [46], 
RSMRMS algorithm [9], boosting [47], and lasso [48]. 
Table 3 and Figures 9, 10, 11, 12, 13, and 14 present dif- 
ferent error rates obtained by various feature selection 
algorithms on six miRNA expression data sets. 

AE and B 1 error 

Table 3 compares the best performance of different feature 
selection algorithms based on the error rate of the SVM. 
From the results reported in Table 3, it is seen that 



Gap estimate 

However, according to Efron and Tibshirani [37], the 
bootstrap approach {Bl) overestimates the error. In this 
regard, the Gap function [49] is generally used to know 
whether the obtained Bl error is smaller than that would 
be expected by chance, if the distribution of the class- 
membership label of the sample did not depend on its 
feature vector. The Gap function represents the difference 
between no-information error (y) and bootstrap error 
(Bl), and is defined by 



Gap = y — Bl. 



(29) 



The larger value of Gap function indicates that the 
obtained or observed Bl error is significantly lower than 
that of expected by chance. Figures 9, 10, and 11 depict 
the gap curves, which highlight the difference between 
y and Bl errors obtained using different algorithms on 
six miRNA data sets. From the results reported in these 
figures, it can be found that the Gap estimate increases 
with the increase in the number of selected miRNAs, 
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Figure 6 Execution time of iiHEN\ algorithm on six data sets for 
different number of selected miRNAs. 
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Figure 7 Different error rates of the proposed algorithm on GSE1 7681 and GSE1 7846 data sets obtained using the SVM averaged over 50 
random splits. 
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Figure 8 Different error rates of the proposed algorithm on GSE21036 and GSE24709 data sets obtained using the SVM averaged over 50 
random splits. 



irrespective of the algorithms and data sets used. Also, 
the Gap function always achieves significantly higher val- 
ues for the proposed /xHEM algorithm, while for both 
boosting and lasso, the gap estimate is very low. Table 3 
compares the best values of the Gap function obtained 



using different algorithms. All the results reported here 
confirm that the proposed algorithm attains highest val- 
ues of Gap function in five cases, while the method pro- 
posed by Golub et al. achieves it only for GSE31408 data 
set. 



Table 2 Comparative analysis of different types of errors for fiHEtA algorithm 



Microarray 
data sets 




AE 




Bl Error 




y Error 


5.632+ Error 


Error 


miRNAs 


Error 


miRNAs 


Error 


miRNAs 


Error 


miRNAs 
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0.000 
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4 
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44 
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2 
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Table 3 Comparative performance analysis of different algorithms 
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Figure 1 1 Gap curve obtained using different methods on GSE28700 and GSE31408 data sets averaged over 50 random splits. 



Paul and Maji BMCBioinformatics 2013, 14:266 
http://www.bionnedcentral.conn/l 471 -2 1 05/1 4/266 



Page 14 of 18 



GSE17681 GSE17846 




0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 

Number of Selected miRNAs Number of Selected miRNAs 

Figure 1 2 ^.632+ errors of the SVM obtained using different methods on GSE1 7681 and GSE1 7846 data sets averaged over 50 random 
splits. 
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Figure 13 5.632+ errors of the SVM obtained using different methods on GSE21036 and GSE24709 data sets averaged over 50 random 
splits. 
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Figure 14 5.632+ errors of the SVM obtained using different methods on GSE28700 and GSE31408 data sets averaged over 50 random 
splits. 
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B.632+ error 

Finally, the performance of different algorithms is com- 
pared with respect to the ^.632+ error. According to 
Efron and Tibshirani [37], the B.632-\- error corrects the 
upward bias in bootstrap error with the downwardly 
biased apparent error. Figures 12, 13, and 14 report the 
variation of the ^.632+ error for different number of 
selected miRNAs obtained by several feature selection 
algorithms on six miRNA expression data sets. From the 
results reported in Table 3 and Figures 12, 13, and 14, 
it can be seen that both boosting and lasso are use- 
ful to select a very small number of miRNAs, but not 
always appropriate to achieve lowest ^.632+ error rate. 
The /xHEM algorithm attains lowest ^.632+ error rate of 
the SVM classifier for GSE17681, GSE21036, GSE24709, 
and GSE31408 data sets, while boosting achieves it only 
on GSE17846 and GSE28700 data sets. The better perfor- 
mance of the proposed /xHEM method is achieved due to 
the fact that it provides an efficient way to compute degree 
of dependency of class labels on feature set in approx- 
imation spaces. In effect, a reduced set of relevant and 
significant miRNAs is being obtained using the proposed 
/xHEM method. 

Execution time 

Moreover, Figure 15 compares the execution time of 
different algorithms for six data sets. From the results 
reported in Figure 15, it can also be seen that the execu- 
tion time of the proposed algorithm is significantly lower 
than that of most of the methods, irrespective of the data 
sets used. However, the execution time of the method pro- 
posed by Golub et al. is slightly lower than that of the 
proposed method. The lower execution time of the pro- 
posed algorithm is achieved due to its low computational 
complexity to compute the relevance and significance 
with respect to the number of selected miRNAs, total 
number of miRNAs and samples in microarray data set. 
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Figure 1 5 Execution time of different algorithms on six miRNA 
expression data sets. 



Biological significance analysis 

This section presents the biological significance of some 
miRNAs those are selected by the proposed /xHEM algo- 
rithm for GSE21036 data set as an example. The manually 
curated database, termed as miR2Disease [50], is used 
here to biologically validate the results obtained by the 
/xHEM algorithm. This database aims at providing a com- 
prehensive resource of miRNA deregulation in various 
human diseases. 

In GSE21036 data set, miRNA expression profiling has 
been done to understand the role of miRNAs that are 
responsible for the genesis and progression of prostate 
cancer [40]. The /xHEM algorithm selects a set of differ- 
entially expressed miRNAs from each bootstrap sample of 
GSE21036 data set. A set of nine miRNAs, consisting of 
hsa-miR-145, hsa-miR-25, hsa-miR-153, hsa-miR-143, 
hsa-miR-19a, hsa-miR-96, hsa-miR-663, hsa-miR-20a, 
and hsa-'miR-'182, is identified from all bootstrap sam- 
ples of GSE21036 data set. Among them, four miRNAs, 
namely, hsa-miR-19a, hsa-miR-20a, hsa-miR-663, and 
hsa-miR-182, are identified by the /xHEM algorithm only, 
not by other feature selection algorithms. 

One of the distinct characteristics of prostate cancer 
is over-expression of the ERG proto-oncogene. Several 
independent target prediction methods have indicated 
that the 3 untranslated region of the ERG mRNA is 
a potential target of hsa-miR-145. The hsa-miR-145 
is consistently down-regulated in prostate cancer. In 
[51], it has been shown that the ERG 3 untranslated 
region is a regulative target of hsa-miR-'145 in vitro. 
From this observation it is suggested that the miRNA 
hsa-miR-145 leads to progression of prostate cancer. 
The down regulation of hsa-miR-145 is also mentioned 
in [52,53]. 

In [54], it has been shown that the hsa-miR-'20a is 
over expressed in prostate cancer. Moreover, Sylvestre 
et al. described an over expression of hsa-'miR-20a in 
the human prostate cancer cell line PC3 using PGR [55]. 
Volinia et al. recorded an up-regulation of hsa-'miR-20a 
in prostate cancer tissue using a microarray assay [56]. 
The identified function of hsa-'miR-20a is the modula- 
tion of the translation of the E2F2 and E2F3 mRNAs via 
binding sites in their 3 -untranslated region [55], which 
supports the oncogenic behavior of hsa-'miR-20a. The 
over expression of hsa-'miR-20a reduces apoptosis in 
the prostate cancer cell line [55]. As suggested in [56] 
and miR2Disease, the hsa-miR-25 is also up-regulated in 
prostate cancer. 

In [57,58], it is shown that hsa-miR-'143 expression 
is clearly down-regulated during prostate cancer pro- 
gression. ERK5 is known to promote cell growth and 
proliferation in response to growth factors and tyro- 
sine kinase activation. Therefore, persistent decreased 
levels of hsa-miR-143 in cancer cells may be directly 
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involved in carcinogenesis through activation of the 
mitogen-activated protein Idnase (MAPK) cascade via 
ERK5. Taken together these findings suggest that hsa- 
miR-143 could be a tumor suppressor and a poten- 
tial novel diagnostic or prognostic marker in prostate 
cancer. 

According to Hirata et al. [59], the hsa-miR-182 reg- 
ulates FOXF2, RECK and MTSSl genes and is therefore 
over expressed in prostate cancer. They have also shown 
experimentally that these three genes are potential targets 
of the hsa-miR-182 and play important role in progres- 
sion of prostate cancer. Another miRNA, hsa-miR-96, 
is shown to be over expressed in prostate cancer as 
mentioned in [60]. 

Conclusion 

The contribution of the paper is two fold, namely, 

1. the development of the /xHEM algorithm for miRNA 
selection, integrating the merits of rough sets and 
hypercuboid equivalence partition matrix; and 

2. demonstrating the effectiveness of the proposed 
algorithm, along with a comparison with other 
algorithms, on several real life miRNA expression 
data sets. 

The concept of hypercuboid equivalence partition 
matrix is found to be successful in selecting relevant and 
significant miRNAs of real valued microarray data sets. 
This formulation is geared towards maximizing the utility 
of rough sets and hypercuboid approach with respect to 
insilico identification of differentially expressed miRNAs. 
The results obtained on six miRNA data sets demon- 
strate that the proposed method can bring a remarkable 
improvement on miRNA selection problem, and there- 
fore, it can be a promising alternative to existing models 
for prediction of class labels of samples. All the results 
reported in this paper demonstrate the feasibility and 
effectiveness of the proposed method. The new method 
is capable of identifying effective miRNAs that may con- 
tribute to revealing underlying etiology of a disease, pro- 
viding a useful tool for exploratory analysis of miRNA 
data. 

Availability and requirements 

Project name: /xHEM (Differentially expressed micro- 
RNA selection method) 

Project home page: www.isical.ac.in/~bibl/results/ 
mihem/mihem.html 

Operating system: developed on Linux (Ubuntu 12.04 
LTS) 

Programming language: C 
Competing interests 

The authors declare that they have no competing interests. 



Authors' contributions 

SP designed the current worl<. PM developed the concept of rough set based 
miRNA selection algorithm. SP implemented it and applied on different miRNA 
expression data sets. Both SP and PM analyzed the results and prepared the 
manuscript. Both authors read and approved the final manuscript. 

Acl<nowledgements 

This work is partially supported by the Indian National Science Academy, New 
Delhi (grant no. SP/YSP/68/2012).The work was done when one of the 
authors, S. Paul, was a Senior Research Fellow of Council of Scientific and 
Industrial Research, Government of India. 

Received: 1 8 March 201 3 Accepted: 30 August 201 3 
Published: 4 September 2013 

References 

1 . Lu J, Getz G, Miska EA, Saavedra EA, Lamb J, Peck D, Cordero AS, Ebert BL, 
Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR: 
MicroRNA expression profiles classify human cancers. Not Lett 2005, 
435(9):834-838. 

2. Budhu A, Ji J, Wang XW: The clinical potential of microRNAs. J Hemato/ 
Onco/ 201 0,3(37):! -7. 

3. Lehmann U, Streichert T, Otto B, Albat C, Hasemeier B, Christgen H, 
Schipper E, Hille U, Kreipe HH, Langer F: Identification of differentially 
expressed microRNAs in human male breast cancer. BMC 
Bioinformatics 201 0, 1 0:1 -9. 

4. Blenkiron C, Goldstein LD, Thome NP, Spiteri I, Chin SF, Dunning MJ, 
Barbosa-Morais NL, Teschendorff AE, Green AR, Ellis IO,Tavare S, Caldas C, 
Miska EA: MicroRNA expression profiling of human breast cancer 
identifies new markers of tumor subtype. Genome Biol 2007, 8:1 -1 6. 

5. Chen Y, Stal lings RL: Differential patterns of microRNA expression in 
neuroblastoma are correlated with prognosis, differentiation, and 
apoptosis. Cancer Res 2007, 67:976-983. 

6. Guo J, Miao Y, Xiao B, Huan R, Jiang Z, Meng D, Wang Y: Differential 
expression of microRNA species in human gastric cancer versus 
non-tumorous tissues. J Gastroenterol Hepatol 2009, 24:652-657. 

7. Schrauder MG, Strick R, Schulz-Wendtland R, Strissel PL, Kahmann L, 
Loehberg CR, Lux MP, Jud SM, Hartmann A, Hein A, Bayer CM, Bani MR, 
Richter S, Adamietz BR, Wenkel E, Rauh C, Beckmann MW, Fasching PA: 
Circulating micro-RNAs as potential blood-based markers for early 
stage breast cancer detection. PLoS ONE 201 2, 7:1 -9. 

8. Zhao H, Shen J, Medico L, Wang D, Ambrosone CB, Liu S: A pilot study of 
circulating miRNAs as potential Biomarkers of early stage breast 
cancer. PLoS ONE 201 0, 5(1 0):1 -1 2. 

9. Paul S, Maji P: Rough sets for Insilico identification of differentially 
expressed miRNAs. IntJ Nanonnedicine 201 3, 8:1 -1 2. 

1 0. Ambroise C, McLachlan GJ: Selection bias in gene extraction on the 
basis of microarray gene-expression data. Proc Natl Acad Sci, USA 
2002, 99(10):6562-6566. 

1 1 . brio MV, Visone R, Leva GD, Donati V, Petrocca F, Casalini P, Taccioli C, 
Volinia S, Liu CG, Alder H, Calin GA, Menard S, Croce CM: MicroRNA 
signatures in human ovarian cancer. Cancer Res 2007, 
67(18):8699-8707. 

1 2. Li S, Chen X, Zhang H, Liang X, Xiang Y, Yu C, Zen K, Li Y, Zhang CY: 
Differential expression of microRNAs in mouse liver under aberrant 
energy metabolic status. J Lipid Res 2009, 50:1 756-1 765. 

1 3. Nasser S, Ranade AR, Sridhart S, Haney L, Korn RL, Gotway MB, Weiss GJ, 
Kim S: Identifying miRNA and imaging features associated with 
metastasis of lung cancer to the brain. In Proceedings of the 3rd IEEE 
International Conference on Bioinfornnatics and Biomedicine. Washington; 
2009:246-251. 

14. Ortega FJ, Moreno-Navarrete JM, Pardo G, Sabater M, Hummel M, Ferrer 
A, Rodriguez-Hermosa JI, Ruiz B, Ricart W, Peral B, Real JMF: MiRNA 
expression profile of human subcutaneous adipose and during 
adipocyte differentiation. PLoS ONE 201 0, 5(2): 1 -9. 

1 5. Pereira PM, Marques JP, Soares AR, Carreto L, Santos MAS: MicroRNA 
expression variability in human cervical tissues. PLoS ONE 201 0, 
5(7):1-12. 

1 6. Raponi M, Dossey L, Jatkoe T, Wu X, Chen G, Fan H, Beer DG: MicroRNA 
classifiers for predicting prognosis of squamous cell lung cancer. 

Cancer Res 2009, 69(14):5776-5783. 



Paul and Maji BMC Bioinformatics 2013, 14:266 
http://www.biomedcentral.eom/1 471 -21 05/1 4/266 



Page 17 of 18 



1 7. Arora S, Ranade AR, Tran NL, Nasser S, Sridhar S, Korn RL, Ross JTD, Dhruv 
H, Foss KM, Sibenaller Z, Ryken T, Gotway MB, Kim S, Weiss GJ: 
MicroRNA-328 is associated with Non-Small Cell Lung Cancer 
(NSCLC) brain metastasis and mediates NSCLC migration. IntJ Cancer 
2011, 129(1 1):2621-2631. 

1 8. Mclver AD, East P, Mein CA, Cazier JB, Molloy G, Chaplin T, Lister TA, 
Young BD, Debernardi S: Distinctive patterns of microRNA expression 
associated with karyotype in acute myeloid leukaemia. PLoSONE 
2008,3(5):1-8. 

19. Wang C, Yang S, Sun G,Tang X, Lu S, Neyrolles 0, Gao Q: Comparative 
miRNA expression profiles in individuals with latent and active 
tuberculosis. PLoS ONE 2011, 6(1 0):1 -1 1 . 

20. Zhu M, Yi M, Kim CH, Deng C, Li Y, Medina D, Stephens RM, Green JE: 
Integrated miRNA and mRNA expression profiling of mouse 
mammary tumor models identifies miRNA signatures associated 
with mammary tumor lineage. Gen Biol 201 1 , 1 2:1 -1 6. 

21 . Xu R, Xu J, Wunsch DC: MicroRNA expression profile based cancer 
classification using default ARTMAP. Neural Netw 2009, 22:774-780. 

22. Pawlak Z: Rough Sets: Theoretical Aspects ofResoning About Data. 
Dordrecht: Kluwer; 1991. 

23. Maji P, Pa I SK: Rough-Fuzzy Pattern Recognition: Applications in 
Bioinfornnatics and Medical imaging. New Jersey: Wiley-IEEE Computer 
Society Press; 2012. 

24. Fang J, Busse JWG: Mining of microRNA expression data: a rough 
set approach. In Proceedings of the Ist International Conference on Rough 
Sets and Knowledge Technology. Berlin, Heidelberg: Springer; 
2006:758-765. 

25. Maji P: Fuzzy-rough supervised attribute clustering algorithm and 
classification of microarray data. IEEE Tran Syst, Man, Cybern, Part B: 
Cy^em 201 1,41:222-233. 

26. Maji P, Pal SK: Fuzzy-rough sets for information measures and 
selection of relevant genes from microarray data. IEEE Trans Syst, Man, 
and Cybern, Part B: Cybern 201 0, 40(3):741 -752. 

27. Maji P, Paul S: Microarray time-series data clustering using 
rough-fuzzy C-means algorithm. In Proceedings of the 5th IEEE 
International Conference on Bioinfornnatics and Biomedicine. Atlanta; 
2011:269-272. 

28. Maji P, Paul S: Rough set based maximum relevance-maximum 
significance criterion and gene selection from microarray data. 

Int J Approximate Reasoning 201 1 , 52(3):408-426. 

29. Maji P, Paul S: Rough-fuzzy clustering for grouping functionally 
similar genes from microarray data. IEEE/ACM Trans ComputBiol 
Bioinformatics 2013. doi:10.1 109^CBB.2012.103. 

30. Paul S, Maji P: Robust RFCM algorithm for identification of 
co-expressed miRNAs. In Proceedings of the 6th IEEE International 
Conference on Bioinformatics and Biomedicine. Philadelphia; 201 2:520-523. 

31 . Paul S, Maji P: Rough sets and support vector machine for selecting 
differentially expressed miRNAs. In Proceedings ofthe 6th IEEE 
International Conference on Bioinformatics and Biomedicine Workshops: 
Nanoinformatics for Biomedicine. Philadelphia; 201 2:864-871 . 

32. Slezak D: Rough sets and few-objects-many-attributes problem: the 
case study of analysis of gene expression data sets. In Proceedings of 
the Frontiers in the Convergence of Bioscience and Information Technologies. 
Cheju Island: IEEE Computer Society; 2007:233-240. 

33. Slezak D, Wroblewski J: Roughfication of numeric decision tables: the 
case study of gene expression data. In Proceedings ofthe 2nd 
International Conference on Rough Sets and Knowledge Technology. Berlin, 
Heidelberg: Springer; 2007:316-323. 

34. Valdes JJ, Barton AJ: Relevant attribute discovery in high dimensional 
data: application to breast cancer gene expressions. In Proceedings of 
the Ist International Conference on Rough Sets and Knowledge Technology. 
Berlin: Springer; 2006:482-489. 

35. Maji P, Paul S: Robust rough-fuzzy C-means algorithm: design and 
applications in coding and non-coding RNA expression data 
clustering. Fundam Informaticae 201 3, 1 24(1 -2):1 53-1 74. 

36. Wei JM, Wang SQ, Yuan XJ: Ensemble rough hypercuboid approach 
for classifying cancers. IEEE Trans Knowl Data Eng 2010, 22(3):381 -391 . 

37. Efron B, Tibshirani R: Improvements on cross-validation: the .632+ 
bootstrap method. J/\m Stat Assoc 1 997, 92(438):548-560. 

38. Keller A, Leidinger P, Wendschlag A, Scheffler M, Meese E, Wucherpfennig 
F, Huwer H, Borries A: miRNAs in lung cancer - studying complex 



fingerprints in patient's blood cells by microarray experiments. 

BMC Cancer 2009, 9:353. 

39. Keller A, Leidinger P, Lange J, Borries A, Schroers H, Scheffler M, Lenhof 
HP, Ruprecht K, Meese E: Multiple sclerosis: MicroRNA expression 
profiles accurately differentiate patients with relapsing-remitting 
disease from healthy controls. PLoS ONE 2009, 4(1 0):e7440. 

40. Taylor BS, Schultz N, Hieronymus H, Gopalan A, Xiao Y, Carver BS, Arora VK, 
Kaushik P, Cerami E, Reva B, Antipin Y, Mitsiades N, Landers T, Dolgalev I, 
Major JE, Wilson M, Socci ND, Lash AE, Heguy A, Eastham JA, Scher HI, 
Reuter VE, Scardino PT, Sander C, Sawyers CL, Gerald WL: Integrative 
genomic profiling of human prostate cancer. Cancer Cell 201 0, 
18:11-22. 

41 . Tseng CW, Lin CC, Chen CN, Huang HC, Juan HP: Integrative network 
analysis reveals active microRNAs and their functions in gastric 
cancer. BMC Syst Biol 201 1 , 5:99. 

42. Ralfkiaer U, Hagedorn PH, Bangsgaard N, Lovendorf MB, Ahler CB, 
Svensson L, Kopp KL, Vennegaard MT, Lauenborg B, Zibert JR, Krejsgaard 
T, Bonefeld CM, Sokilde R, Gjerdrum LM, Labuda T, Mathiesen AM, 
Gronbaek K, Wasik MA, Sokolowska-Wojdylo M, Queille-Roussel C, 
Gniadecki R, Ralfkiaer E, Geisler C, Litman T, Woetmann A, Glue C, Ropke 
MA, Skov L, Odum N: Diagnostic microRNA profiling in cutaneous 
T-cell lymphoma (CTCL). Blood 201 1, 1 18(22):5891-5900. 

43. Vapnik V: The Nature of Statistical Learning Theory. New York: 
Springer-Verlag; 1995. 

44. Quinlan JR: C4.5: Programs for Machine Learning. CA: Morgan Kaufmann; 
1993. 

45. Ding C, Peng H: Minimum redundancy feature selection from 

Microarray gene expression data. J Bioinformatics Comput Biol 2005, 
3(2):1 85-205. 

46. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, 
Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: 
Molecular classification of cancer: class discovery and class 
prediction by gene expression monitoring. Science 1 999, 286:53 1 -537. 

47. Buelmann P, Yu B: Boosting with the L2 loss: regression and 
classification. J Am Stat Assoc 2003, 98:324-339. 

48. Tibshirani R: Regression shrinkage and selection via the lasso. 
J R Stat SocB] 996, 58:267-288. 

49. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, 
Botstein D, Brown P: 'Gene Shaving' as a method for identifying 
distinct sets of genes with similar expression patterns. Genome Biol 
2000, 1(2): 1-21. 

50. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y: 
miR2Disease: a manually curated database for microRNA 
deregulation in human disease. Nucleic Acids Res 2009, 37:D98-D1 04. 

51. Hart M, Wach S, Nolte E, Szczyrba J, Menon R,Taubert H, Hartmann A, 
Stoehr R, Wieland W, Crasser FA, Wullich B: The proto-oncogene ERG is 
a target of microRNA miR-145 in prostate cancer. FEBSJ 201 3, 
280(9):21 05-21 16. 

52. Ozen M, Creighton CJ, Ozdemir M, Ittmann M: Widespread 
deregulation of microRNA expression in human prostate cancer. 
Oncogene 2007, 27:1 788-1 793. 

53. Wang L, Tang H, Thayanithy V, Subramanian S, Oberg AL, Cunningham 
JM, Cerhan JR, Steer CJ, Thibodeau SN: Gene networks and microRNAs 
implicated in aggressive prostate cancer. Cancer Res 2009, 
69(24):9490-9497. 

54. Pesta M, Klecka J, Kulda V,Topolcan 0, Hora M, EretV, Ludvikova M, 
Babjuk M, Novak K, Stolz J, Holubec L: Importance of miR-20a 
expression in prostate cancer tissue. Anticancer Res 201 0, 
30(9):3579-3583. 

55. Sylvestre Y, De Guire V, Querido E, Mukhopadhyay UK, Bourdeau V, 
Major F, Ferbeyre G, Chartrand P: An E2F/miR-20a autoregulatory 
feedback loop. J Biol Chem 2007, 282(4):2 1 35-21 43. 

56. Volinia S, Calin GA, Liu CG, Ambs S, Cimmino A, Petrocca F, Visone R, 
brio M, Roldo C, Ferracin M, Prueitt RL, Yanaihara N, Lanza G, Scarpa A, 
Vecchione A, Negrini M, Harris CC, Croce CM: A microRNA expression 
signature of human solid tumors defines cancer gene targets. Proc 
Nat Acad Sci, USA 2006, 103(7):2257-2261 . 

57. Clape C, Fritz V, Henriquet C, Apparailly F, Fernandez PL, Iborra F, Avances 
C, Villalba M, Culine S, Fajas L: miR-143 interferes with ERK5 signaling, 
and abrogates prostate cancer progression in mice. PLoSONE 2009, 
4(10):e7542. 



Paul and Maji BMCBioinformatics 2013, 14:266 
http://www.bionnedcentral.conn/l 471 -2 1 05/1 4/266 



Page 18 of 18 



58. Porkka KP, Pfeiffer MJ, Waltering KK, Vessella RL, Tammela TL, Visakorpi T: 
MicroRNA expression profiling in prostate cancer. Cancer Res 2007, 
67(1 3):61 30-61 35. 

59. Hirata H, Ueno K, Shahryari V, Deng G, Tanaka Y, Tabatabai ZL, Hinoda Y, 
Dahiya R: IVIicroRNA-182-5p promotes cell invasion and proliferation 
by down regulating F0XF2, RECK and MTSSI genes in human 
prostate cancer. PLoS ONE 201 3, 8(1 ):e55502. 

60. Schaefer A, Jung M, Mollenkopf HJ, Wagner I, Stephan C, Jentzmik F, 
Miller K, Lein M, Kristiansen G, Jung K: Diagnostic and prognostic 
implications of microRNA profiling in prostate carcinoma. 
IntJ Cancer 201 0, 1 26(5):1 1 66-1 1 76. 



doi:1 0.1 1 86/1 471 -21 05-1 4-266 

Cite this article as: Paul and Maji: /xHEM for identification of differentially 

expressed miRNAs using hypercuboid equivalence partition matrix, BMC 

Bioinformatics 20]3 14:266. 
^ y 



Submit your next manuscript to BioMed Central 
and take full advantage of: 



• Convenient online submission 



• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



(3 BioMed Central 



